home Today's News Magazine Archives Vendor Guide 2001 Search isdmag.com

Editorial
Today's News
News Archives
On-line Articles
Current Issue
Magazine Archives
Subscribe to ISD


Directories:
Vendor Guide 2001
Advertiser Index
Event Calendar


Resources:
Resources and Seminars
Special Sections


Information:
2001 Media Kit
About isdmag.com
Search isdmag.com
Contact Us





CO-Design Method Enables Speech Recognition SoC

By Lorenzo Cali, Francesco Lertora, Monica Besana And Michele Borgatti
Integrated System Design
Posted 11/06/01, 05:27:00 PM EDT

Ambiguous or incorrect specifications are a major cause of failure in systems design. Thus, it is critical for the system designer to capture system specifications in an executable, unambiguous format. Doing so makes for agreement among the customer, the system provider and the various design teams early in a concurrent hardware/software design process. In a pilot project for the design of an SoC-based speech-recognition system, STMicroelectronics implemented a system-level design flow based on the CoWare N2C ("napkin-to-chip") methodology. Our first prototype system comprised a voice-activated controller with built-in speech-recognition capabilities and implemented a novel, fully hardwired, ST-proprietary algorithm. That design was based on a 0.25-micron ASIC and used external nonvolatile memory to store word templates and parameters. Although this first prototype consumed very little power and produced significant recognition performance, it could not be mapped easily into a single system-on-chip (SoC), largely because it did not provide any flexibility. For example, control state machines and data paths were fully hardwired. In our new design, we maintained the earlier functionality but were able to parameterize functions previously hardwired for a single purpose, so that now they can be more flexible. Some are implemented in hardware, others in software. The main processing functions are organized as hardware coprocessors integrated around an ARM7TDMI 32-bit processor and a 2-Mbit embedded flash memory, to improve the flexibility of the recognition system. This latest SoC also includes an Amba-APB (advanced peripheral bus) and a number of peripherals, timers, interrupt controllers and other associated functions.


Two engineers completed the system-to-RTL flow in six months. Concurrently, the complete software development flow took two software engineers six months. The overall effort took two man-years. Our speech-recognition SoC achieved a successful tapeout and was 100 percent functional on the first pass. The chip includes 300,000 logic gates that account for about 5 million transistors.

In the design we paid particular attention to the speech-recognizer architecture. The recognizer has a preprocessor that provides data for both the end point detection (EPD) process and the autocorrelation function (ACF) computation filters. EPD performs the silence/speech discrimination, while concatenation between the ACF and linear predictive cepstrum (LPC) blocks translates each incoming word as a sequence of speech samples into a variable-length sequence of cepstrum feature vectors (see Picone article, cited among the references p.20).

The compressor block transforms the obtained cepstrum vectors in a suitable memory structure, which is next stored in the word RAM. An on-chip flash memory is tightly coupled to a dynamic time warp (DTW) distance-calculation coprocessor, which stores the database used during the recognition phase. The DTW engine is composed of two blocks, DTW out loop and DTW inner loop, which give the result of the recognition process to the norm-and-voting-rule block. Next, the norm-and-voting-rule block calculates a normalized distance among the acquired word and all the words that are stored in the recognition database. The best distance between the acquired word and the stored word is supplied to the application according to a chosen voting rule.

We had several main project goals for implementing the design in an SoC:

  • Investigate a top-down design methodology by exploring hardware/software co-design and architecture choices.
  • Achieve first-time design success through a design methodology that synthesizes correct-by-construction hardware/software interfaces.
  • Design reusable and socketized versions of highly parametric functional blocks (functional and cycle-true encapsulations).
  • Exploit architectures to drive the specifications for embedded flash memories for use in a signal-processing context.
  • Perform hardware/software co-verification and assess the eventual use of hardware emulation for RTL sign-off.

    A major design challenge was to assure the operations in real-time. Once we fixed a set of system parameters allowing high-performance speech recognition, we had to make sure software and hardware interactions worked at a reasonable operations frequency. Also, achieving a minimum operating frequency became crucial since our SoC design is targeted at low-power applications.

    We inherited a full C/C++ description of system function, testbenches and a handful of complex intellectual-property blocks described in VHDL-RTL from previous projects. In parallel, we completely redesigned the main IP blocks devoted to speech processing and generalized them as highly parametric IP for compile- or run-time architectural parameterizing. Parameterization defines functional, architectural and microarchitectural characteristics of the IP blocks.


    We developed a functional model in C/C++ of the speech-recognition application starting from the original code. This phase of the project included a significant generalization of the functional blocks of the system, since the original specification was written with a fully hardwired implementation in mind. This generalization was intended to give more flexibility for tuning the recognition performance into the final application, but also to allow full architectural exploration.

    A major activity was the functional modeling of the available VHDL IP blocks using the CoWare-C language. Both pure-functional (untimed) C models and encapsulations of VHDL models were provided. Communication details were removed in functional models to completely separate functionality and communication. This made it possible to quickly evaluate different hardware/software partitions of the system using interface synthesis.

    We evaluated the system-executable specification in a multilevel simulation through top-down model refinement. First, we used untimed simulation to verify that the system specification fit into the application testbench, and to verify the consistency between the implemented communication mechanisms and tool communication semantics. This part of the design, both hardware and software, was performed in CoWare-C (simulated in an N2C environment); hence, differentiation between hardware and software was not needed in the beginning. All blocks were connected by means of communication primitives, allowing simulation of system behavior as it would eventually be implemented.

    With this C untimed model, we engaged in in-depth partitioning to better understand system bottlenecks. It became clear that for some functions, it was difficult to have a high-performance software implementation. For example, we had to use hardware acceleration in the autocorrelation function (namely, many concurrent sums of products) and in the dynamic time warp inner loop (i.e., a considerable amount of sums and multiplications are requested during a recognition phase).

    The next design phase was to integrate the main hardware blocks with the rest of the system. In this context we decided to leave in software both the Levinson-Durbin recursion algorithm (LPC analysis has an irregular execution flow) and the EPD to achieve a good trade-off between flexibility and hardware performance. During this phase, CoWare was very useful in helping to implement such chip functions as communications between hardware/software and software/hardware, interrupt management and others.

    We then evaluated different system hardware and software functions to determine how long it takes to execute the software path. This intermediate-level simulation (BCASH) is cycle-true for software execution and untimed for hardware. It is used to evaluate the performance of a candidate system hardware/software partition early.

    BCASH cycles through in the software path, but is abstracted in the hardware path. You assume all hardware is in zero-delay execution, and the software path takes the actual amount of clock cycles that it will have in the final implementation. Going from the C-description untimed simulation to BCASH happens very quickly, making it an ideal way to drive architectural exploration. By moving fast from the pure C description to a partition that provides real software-execution timing numbers, you can evaluate and select a large number of partitions in a short amount of time.

    The following is an example of an interaction between hardware (HW) and software (SW) during the verification process. The preprocessor in this design (HW) computes sample energy and sends this value to the ARM7 core by interrupt. The interrupt wakes up the end point detection routine (SW), and it executes its algorithm to understand if a word is present in the acquired samples. When a word is detected, a software routine starts the acquisition control logic (HW), which reads the samples and then stores them in a buffer with a fixed timing.

    The autocorrelator (HW) then reads the buffer's every frame (a user-defined speech segment) and stores the autocorrelation vectors in another buffer. During each frame, the interrupt generated from the logic inside the autocorrelator signals to the core that the new autocorrelation values are ready. At this point, the interrupt handler (SW) reads this buffer and gives those values to the LPC function.

    Detailed bus-cycle-accurate (BCA) simulation is used next to sign off the performance of the final system configuration. BCA ensures that your simulation is accurate in terms of the number of clock cycles for both hardware and software paths. At this stage of development, detailing the functional blocks belonging to the hardware partition is equivalent to VHDL/Verilog coding. The main voice-processing modules were available as parametric VHDL IP and recoding in Register Transfer C (RTC), a proprietary CoWare language, was not necessary.

    At this stage any changes to the partition are extremely painful. That's because the intense VHDL work already performed has to be duplicated, thus wasting valuable design time. This means that the BCASH simulation must lead to the final partition.

    One critical aspect of designing large, complex systems is to partition the effort of coding and verifying the implementation of functional blocks. A testbench needs to be created to verify each one of the functional blocks. One of the strengths of the N2C approach is that the system simulation can include a mix of different levels of abstraction blocks mapped into the hardware partition. This makes it possible to selectively operate on the implementation of each single HW block, one following the other, and proceed in the refinement steps using the system-level simulation as a testbench for the block itself. Thus, there is no need to have the whole system specified at the RT level before starting the system validation.

    To avoid using the proprietary RTC language, not all the system modules were refined in the N2C environment. That means a mixed simulation with commercially available VHDL simulators was required, at the expense of much longer simulation times.

    We used an instruction-set simulator (ISS) for the processor. CoWare includes an ISS in the BCA simulator provided by the N2C tool; this is then connected to a VHDL simulator (ModelSim, for example). The resulting mixed N2C, ISS and VHDL simulation could not be used for extensive simulations because of reduced overall speed. Instead, we used N2C-generated VHDL code to integrate the full VHDL RTL description of the chip. We used hardware emulation for RTL sign-off.

    N2C generates a VHDL netlist for the entire system including the processor, buses, standard blocks and newly designed blocks. This overall netlist connects all functionality, so that you can then export it through a conventional RTL for the outflow. Furthermore, we used netlist sign-off to certify the overall correctness of the VHDL produced by the CoWare N2C tool.

    With the VHDL co-simulation we tested our RTL code against the one produced with the CoWare interface

    synthesis.

    We found that the CoWare HDL co-simulation feature allowed us to further improve the path from higher to lower levels of abstraction, and vice versa.

    ---
    Acknowledgements
    The authors wish to acknowledge CoWare Europe for the support and assistance given to the speech-recognition design team.

    ---
    References
    Bolsens, De Man, Lin, Van Rompaey, Vercauteren and Verkest, "Hardware/software co-design of digital telecommunication systems," Proceedings of the IEEE, Vol. 85, No. 3, March 1997.

    Clement, Hersemeule, Lantreibecq, Ramanadin, Coulomb and Pogodalla, "Fast prototyping: a system-design flow applied to a complex system-on-chip multiprocessor design," Proceedings Of 36th Design Automation Conference, 1999.

    ISD Magazine, Alcatel story (www.isdmag.com/editorial/2000/design0008.html), Fujitsu story (www.isdmag.com/editorial/1999/coverstory9907.html).

    J.W. Picone, "Signal-modeling techniques in speech recognition," Proceedings of the IEEE, Vol. 81, September 1993.

    http://www.isdmag.com

    © 2001 CMP Media LLC.
    11/1/01, Issue # 13149, page 12.


     

  • Sponsor Links

    All material on this site Copyright © 2001 CMP Media Inc. All rights reserved.